Chapter 2: Measurement and Description

Survey

Here’s a link to a short survey:

https://forms.gle/jEtCucpW3W6Fjr1u9

If you want to follow along with the analysis, you can copy-paste code from here:

https://raw.githubusercontent.com/Neilblund/GVPT-201-Site/refs/heads/main/R%20code/gsheet_descriptives_pt1.R

Goals

  • Identify features of a variable

  • Recognize the level of measurement

  • Describe central tendency

  • Describe dispersion

Concepts and variables

  1. Definition: identifying concrete features of the objects or events we’re studying and the tools to measure them.

  2. Measurement representing real-life objects and events as variables

  3. Description making generalizations about variables

Variables

  • The results of the measurement process.

  • By definition a variable must vary

  • How we describe a variable will depend on how we measure it.

Variable Types

3 broad “types” of variables, each with slightly different properties1

  1. Nominal

  2. Ordinal

  3. Interval

Nominal Variables

  • Nominal variables are categorical variables with no intrinsic ordering.
  • Examples include the names of countries or people, someone’s religion or racial/ethnic self id
  • Some nominal variables are represented by numbers, but the values of those numbers are arbitrary: zip codes, jersey numbers, and telephone numbers are still nominal because there’s no ordering to them.

The region of a particular U.S. state is a nominal variable: there’s a fixed number of categories and no intrinsic ordering

Ordinal Variables

  • Ordinal variables have a small number of categories that can be ordered.

  • However, the gaps between differing ranks may be unequal.

Top finishing times for Boston Marathon in 2024. The placements are ordinal: the distance in time between first and second place doesn’t necessarily equal the distance between second and third.

Ordinal Variables

A common source of ordinal variables will be survey items that ask people to rate their position on a scale from “strongly agree” to “strongly disagree” or “very important” to “not at all important”.

ANES question from 2020 about the importance of people agreeing about basic facts

Ordinal Variables

Another source of ordinal variables might be data that we’ve grouped into ordered categories based on another variable.

For instance: the World Bank classifies countries into four ordinal categories based on their per-capita GDP. There’s a clear ordering, but the spaces are not equal.

Ordinal Variables

Two things to keep in mind when working with ordinal variables:

  • The “ordering” might be partly a question of your own research question. You could flip these or combine categories to arrange them from “most partisan” to “least partisan”.

  • Some survey variables may only be ordinal after you remove all the people who gave “don’t know” responses.

Interval Variables

  • These are just numbers. They’re measured along a continuum with equal spacing (i.e., the difference from 3 and 4 is “the same” as the difference between 6 and 7)

  • Examples: age, height, temperature, distance.

Interval Variables

“True” interval variables are less common in survey research, but we’ll often treat ordinal variables as “more-or-less interval” if they a lot (7+) categories

Strictly speaking these “feeling thermometer” responses are more like ordinal variables. But they’re close enough for us to treat them like interval variables for most purposes.

Nuance!

Dichotomous/Dummy Variables

  • Dichotomous variables that take on only two values: TRUE/FALSE, or Republican/Democrat, War/Peace etc.

  • “Dummy” variables will encode this dichotomy as 0s and 1s, which can simplify some math operations

    • Most importantly: the mean of a dummy variable = the proportion of 1s in the data!
dummy_vector <- c(1,1,1,1,0,0,0,0,0,0)

mean(dummy_vector)
[1] 0.4

Dichotomous/Dummy

Dummy coding is important for statistical modeling because how we can make nominal data into meaningful numeric data:

Who did you vote for in 2020? Trump Biden Stein
Trump 1 0 0
Biden 0 1 0
Jill Stein 0 0 1

Index variables

Some surveys may measure an attitude by asking multiple questions on the same topic and then aggregating those responses. These aggregate indexes are often treated as interval-level variables.

The values highlighted in red indicate more “authoritarian” attitudes toward child rearing.

question value percent
Considerate vs. Well-behaved Being considerate 72%
Well behaved 28%
Curiosity vs. Good Manners Curiosity 39%
Good manners 61%
Self Reliance vs. Obedience Obedience 44%
Self-reliance 56%
Independence vs. Respect for Elders Independence 31%
Respect for elders 69%

The “authoritarianism” column is an index variable created by counting the number of “authoritarian” answers to a single question.

authoritarianism percent cumulative %
0 18.3% 18.3%
1 17.7% 36.0%
2 23.2% 59.2%
3 25.1% 84.2%
4 15.8% 100.0%

Multiple levels of measurement

It is often possible to measure the same variable at multiple levels of measurement.

“Education”, for instance, could be recorded as interval, ordinal, or dichotomous:

Years of schooling Highest Level of Schooling Completed Some College
9 Less than High School No
10 Less than High School No
12 High School No
13 Some College Yes
14 Some College Yes
15 Some College Yes
16 Bachelor’s Yes
17 Post Bachelor’s Yes

Multiple levels of measurement

Its often preferable to use the highest level of precision available, but sometimes we choose a less precise measure because its more parsimonious, less “noisy”, or easier to display in a table or graph.

For example: many analyses of Trump supporters will collapse education into college vs. non-college because (at least for whites) there’s a clear division between college grads and non-college grads:

Unit of analysis and levels of measurement

Keep in mind that aggregation changes the unit of analysis. “Did you vote for Trump in 2020?” is dichotomous, but Trump’s share of the vote across an entire state is continuous:

Who cares?

  • The level of measurement will be a key constraint on our choices of descriptive statistics and graphs.

  • When we get to statistical modeling, variable types will matter for the kinds of models we can use.

  • Variable types will also matter for how data are stored and analyzed in R.

Visualization

Histogram

  • Use: visualizing the distribution of interval variables.

  • Divide data into equally sized “bins” and count the number in each. The height of each bar indicates the number of values in that bin.

Density Plot

  • Use: visualizing the distribution of interval variables

  • Sort of a “smoothed” version of the histogram. The area of the entire curve is one, and the height of the curve at a given point indicates how much of the data is in that region.

Box plots

  • Use: visualizing the distribution of interval variables.

  • Shows the “five number summary” (minimum, 25th percentile, median/mean, 75th percentile, maximum)

  • Especially useful for making comparisons across groups or describing multiple items with similar scales.

Bar plot

  • Use: visualizing the distribution of categorical variables

  • Count the frequency (or proportion) of observations in each group

Visualization

Nominal Ordinal Interval
Bar chart
Histogram
Density plot
Box Plot

Describing Data

In addition to visualization, we generally want to be able to summarize and compare characteristics like:

  • Central Tendency: “typical values” of the variable

  • Dispersion: the amount of spread around the central tendency

  • Modality: the number of “peaks” or “modes” in a distribution.

  • Skewness: the amount of asymmetry in a variable.

Measures of Central Tendency

(some things you probably remember from school)

(Arithmetic) Mean

Sum up all the numbers and divide by the total number of observations

\[ \bar{x} = \frac{1}{n}\sum^n_{i=1}x_i \]

\[ \bar{x} = \text{the mean of x} \]

\[ x_i = \text{the individual values of x} \]

\[ n = \text{the number of observations} \]

(Arithmetic) Mean

A useful feature of the mean: the summed residuals from \(\bar{x}-x = 0\)

\(x\) \(x-\bar{x}\)
3 \(3 - 6 = -3\)
4 \(4 - 6 = -2\)
6 \(6-6 = 0\)
11 \(11 - 6 = 5\)
Total: \(24\)
Mean: \(24/6 = 6\)
Total: \(-3 + -2 + 0 + 5 = 0\)

Using the mean to predict an outcome means that we’ll sometimes predict values that are too high or too low, but over time the total sum of those errors will equal zero.

(Arithmetic) Mean

A problematic feature of the mean is that its sensitive to extreme outliers (also known as skew)

The average height of Victor Wembanyama (7 ft 4) and a bunch of regular people is probably misleading.

Median

For an even number of observations, the median is the middle number:

\[ x = 1, 3, 3,6, 7, 8, 9 \]

\[ \text{Median} = 6 \]

For an odd number of observations, the median is the mean of the two middle values:

\[ x = 1, 3, 3,6, 7, 8, 9, 11 \]

\[ \text{Median} = 6.5 \]

Median

Importantly, the median is a skew-robust measure of central tendency.

The mean and median will be similar if there’s no skew:

\[ x = 1, 3, 3, 6, 7, 8, 9, 11 \]

\[ \text{Median of x} = 6.5 \]

\[ \text{Mean of x} = 6 \]

But they diverge when we include extreme outliers:

\[ x = 1, 3, 3, 6, 7, 8, 9, 100000000000 \]

\[ \text{Median of x} = 6.5 \]

\[ \text{Mean of x} = 12500000000 \]

Mode

The modal value is the value that occurs most often.

\[ x = 1, 3, 3, 6, 7, 8, 9, 11 \]

\[ \text{Mode of x = 3} \]

Unlike the mean, the mode is a valid measure of central tendency for nominal variables:

\[ \text{Tom, Earl, Tom, Sarah, Beth} \]

\[ \text{Mode = Tom} \]

Modality

Variables may have more than one modal value. For instance, the cross-national distribution of male average years of schooling is roughly bimodal, mainly because different countries will different compulsory schooling requirements.

By contrast the % of a country’s population that is working-age is unimodal: most countries have between 65 to 70% and there’s no other value that is nearly as common.

Central Tendency

Nominal Ordinal Interval
Mean
Median
Mode

Measures of Dispersion

Standard Deviation

The standard deviation is a … standard measure of dispersion for interval variables based on squared deviations from the mean. A larger standard deviation, all else equal, indicates that observations tend to deviate from the mean more.

Standard Deviation: Steps

To calculate the standard deviation for a sample:

1. Calculate \(\bar{x}\) (the mean of \(x\))
2. Calculate the residual (\(\bar{x} - x_i\)) for each value
3. Square each residual and then calculate the total sum of squares (TSS)
3. Calculate the variance by dividing this total by the number of observations (minus 1)
4. Calculate the standard deviation by taking the square root of the variance

x Deviation from mean (5) Differences squared
2 -3 9
4 -1 1
4 -1 1
4 -1 1
5 0 0
5 0 0
5 0 0
7 2 4
9 4 16
Mean = 5 Total = 0 TSS = 32

\[\text{Var(x)}=\frac{32}{(9-1)} = 4\] \[s_x = \sqrt4 = 2\]

Standard Deviation

Fortunately, we don’t have to do this by hand:

x<-c(2,4,4,4,5,5,5,7,9)

sd(x)
[1] 2

The key thing to remember is just that the standard deviation is sort of like “an average of differences from the average”

Range and IQR

  • Range is simply the difference between the lowest and highest value

  • Interquartile Range is the difference between the 25th and 75th quartile of a variable

(which corresponds to the box part of a box-and-whiskers plot)

Dispersion

Nominal Ordinal Interval
Standard Deviation
IQR

Skew

Skew refers to the degree of asymmetry in data.

No Skew

When the distribution is basically symmetric, the mean and the median essentially overlap.

Right Skew

With right skew, extreme high values pull the mean higher than the median.

Left Skew

Reminders

  • Pilot design questions due Feb 7.

  • Homework 1 due Feb 13.

Notes

Stevens, S. S. 1946. “On the Theory of Scales of Measurement.” Science 103 (2684): 677–80. https://doi.org/10.1126/science.103.2684.677.